1  Analysis with Polars

Important

This book is a first draft, and I am actively collecting feedback to shape the final version. Let me know if you spot typos, errors in the code, or unclear explanations, your input would be greatly appreciated. And your suggestions will help make this book more accurate, readable, and useful for others. You can reach me at:
Email:
LinkedIn: www.linkedin.com/in/jorammutenge
Datasets: Download all datasets

Polars was needed because the current state of dataframe implementations didn’t use the knowledge available in database and query engine research.

Ritchie Vink


If you’re reading this book, you’re likely interested in learning how to analyze data using Polars. You might already have experience with tools such as SQL, pandas, or Spark, or you might be completely new to data analysis. Regardless of your background, this chapter establishes the foundation for the topics covered throughout the book and ensures we share a common vocabulary. I’ll begin by explaining what data analysis is, then introduce Polars: what it is, why it’s gaining popularity, how it compares to other tools, and how it fits into the broader process of data analysis.

1.1 What Is Data Analysis?

Data analysis is the process of transforming raw data into meaningful insights that guide decisions, drive innovation, and improve performance across industries. Its growing importance is reshaping the job market and redefining how organizations operate.

Data analysis refers to the systematic examination of data to uncover patterns, trends, and relationships. It involves techniques such as statistical modeling, data visualization, and computational algorithms to interpret large volumes of information. Whether it’s customer behavior, financial performance, or operational efficiency, data analysis helps convert complex datasets into actionable knowledge. This process is foundational in fields ranging from healthcare and finance to marketing and public policy.

The importance of data analysis has surged in recent years due to the exponential growth of digital data. With the rise of social media, e-commerce, and IoT devices, organizations now collect vast amounts of information daily. Without analysis, this data remains untapped potential. Businesses use data analytics to predict market trends, personalize customer experiences, and optimize supply chains. In a competitive landscape, the ability to make informed decisions quickly is a strategic advantage, making data analysis indispensable.

In today’s job market, roles associated with data analysis are among the most in-demand. Positions such as data analyst, business intelligence analyst, data scientist, analytics engineer, and machine learning engineer are increasingly common across industries. These roles require proficiency in tools like SQL, R, pandas and now Polars. Business Intelligence (BI) tools like Tableau or Power BI are also used. Employers value candidates who not only possess strong data manipulation skills but also excel at translating complex findings into clear, compelling insights for stakeholders. In my professional experience, I’ve witnessed excellent analyses fail to gain traction with senior leadership simply because they were presented ineffectively. The ability to communicate data-driven insights in a way that resonates with decision-makers is just as critical as the analysis itself. The demand for data analysis skills reflects a broader shift toward data-driven cultures in organizations.

Data analysis profoundly influences decision making by enabling evidence-based strategies. Instead of relying on intuition or outdated assumptions, leaders can evaluate real-time data to guide their choices. For example, marketing teams use analytics to segment audiences and tailor campaigns, while operations managers rely on predictive models to forecast demand and manage inventory. This analytical approach reduces risk, enhances agility, and fosters innovation. As organizations continue to embrace digital transformation, data analysis will remain central to strategic planning and execution.

1.2 Why Polars?

This section explores Polars as a data analysis library, highlighting its key advantages, comparing its performance and usability to pandas, and examining how Polars integrates into the broader analytical workflow.

1.2.1 What Is Polars?

Polars is a blazingly fast dataframe library for manipulating structured data.1 It is not a programming language but a library that can be used within other languages such as Python, Rust, and R. Created by Ritchie Vink, Polars was first released in 2020 and has since evolved through contributions from a growing community of developers. I’ll use Polars in the Python programming language throughout the book.

According to the Polars user guide, the library is built on a philosophy of delivering lightning-fast performance and efficiency in data analysis. It achieves this by leveraging all available CPU cores, optimizing queries to minimize unnecessary computation and memory usage, and handling datasets that exceed the size of your system’s RAM. Polars also provides a consistent and predictable API, built around a strict schema where data types are known before execution, ensuring reliability and clarity in analytical workflows. Written in Rust, it benefits from C/C++-level performance and precise control over performance-critical aspects of its query engine, making it a powerful tool for modern data processing.

Understanding the Polars data model is essential to using the library effectively. At its core, Polars provides two main data structures: the Series and the DataFrame, which are used for working with array data and tabular data, respectively. A dataframe is a collection of series, and each series is a one-dimensional structure. Certain rules apply to the series within a dataframe: they must all have the same length, each must have a unique name, and all elements within a series must share the same data type. While dataframes and Series share some methods (functions that can be called on them), many are specific to each structure.

Tip

A dataframe can consist of a single column; it does not need to have many.

Reading data from a CSV file with Polars creates a dataframe. Polars automatically tries to infer the data type for each column. If it cannot determine the type, it assigns String as the default. In addition to CSV, Polars can read other file formats such as Parquet, Avro, and Excel. It can also load data directly from a database, Delta Lake, or even from data copied to your clipboard. Likewise, Polars can write data to all of these formats and more.

One of Polars’ most distinctive features is its use of expressions. Expressions are composable building blocks that define operations and transformations on dataframes. Instead of manipulating raw data directly, you create expressions that describe what you want to compute. For example, pl.col('Price') × pl.col('Discount') multiplies the Price and Discount columns in a dataframe. With expressions, you can target specific columns, apply functions, filter rows, and perform aggregations while maintaining clear and declarative syntax. These expressions can range from simple to highly complex, depending on your needs.

The power of expressions lies in their ability to chain operations together. By combining multiple expressions, you can build efficient data pipelines from simple, reusable components. For instance, you might start by selecting columns, apply a transformation such as a logarithm or normalization, filter rows based on a condition, and then group and aggregate–all within a single expression chain. Much of the code in this book follows this method-chaining style. This modular approach promotes clean, readable code and makes it easier to debug or extend data pipelines. More importantly, it leverages the Polars query engine, which analyzes your code and constructs an optimized execution plan. As a result, chaining expressions can significantly improve the execution speed of your code.

1.2.2 Benefits of Polars

Compared to other dataframe libraries such as pandas, Dask, PySpark, or Ibis, Polars is relatively new. However, its growing popularity is one of the main reasons to consider it as your primary data analysis tool. I predict that soon, more companies will begin listing Polars knowledge as a desirable, if not required, skill for data professionals.

The two strongest reasons to use Polars are its speed and its simple, intuitive syntax. One of the most common sayings in the Polars community is, “I came for the speed, I stayed for the syntax.” This perfectly reflects my own experience. Compared to pandas, the library I first learned, Polars code reads almost like plain English. You’ll see how natural it feels when we explore its structure in Chapter 2.

In addition to its query optimizer, Polars achieves high performance through multithreading, which allows it to use all available CPU cores on your machine. The Polars Decision Support (PDS) benchmark compares Polars to other data analysis libraries in terms of execution speed, where lower times indicate better performance. Table 1.1 shows the latest PDS-H benchmark results at scale factor 10 (SF-10), where one unit of scale factor represents roughly 1 GB of CSV data. PDS-H supports running queries across multiple data sizes.

Table 1.1: PDS-H benchmark: execution time for each query.
Solution Total time (sec) Factor
polars[streaming]-1.30.0 3.89 1.0
duckdb-1.3.0 5.87 1.5
polars[in-memory]-1.30.0 9.68 2.5
dask-2025.5.1 46.02 11.8
pyspark-4.0.0 120.11 30.9
pandas-2.2.3 365.71 94.0

Another advantage of Polars is its expression API, which makes it highly extensible. Polars encourages a method-chaining style of coding, made possible through expressions. This approach allows you to build complex data pipelines by chaining expressions together. The expression API has been so effective that pandas will adopt similar functionality in version 3.02.

Polars is also easy to learn. Adopting a new technology can be challenging, especially if you’re already familiar with another tool. Fortunately, Polars has a straightforward and intuitive syntax that makes the code easy to read and reason about. When I primarily wrote pandas code, it often took me half an hour or more to recall what my code from the previous week was doing. Polars code, on the other hand, reads like plain English. Now it only takes me a few minutes to understand my older Polars scripts. At first glance, the Polars user guide might seem overwhelming because of the long list of methods and expressions. The good news is that you don’t need to learn everything at once. Start with the general methods and expressions, then expand your knowledge over time. In my experience, learning sticks best when you encounter a real problem and then seek out the method that solves it, rather than trying to memorize a long list in advance.

Finally, Polars has an active and supportive community. The Polars team frequently updates the codebase, continually improving the library. You can raise issues or report bugs and expect quick responses. The Polars Discord community is also an excellent place to ask questions. I’ve asked several myself and have always been impressed by how quickly and thoughtfully people respond. I’ve also learned a great deal from discussions and solutions shared by other members, so even observing the conversations in the Discord channel can be an effective way to learn.

1.2.3 Polars versus pandas

The de facto data analysis library in the data field is pandas. It provides many of the same functionalities as Polars and even more in some areas. Both libraries also offer unique features that set them apart. In general, almost any analysis you can perform with pandas can also be done with Polars. However, several important differences distinguish the two.

The most significant difference between Polars and pandas is the lazy API (lazy mode), which is unique to Polars. When you use the lazy API, Polars does not execute each query line by line. Instead, it processes the entire query from start to finish as a single operation. Using the lazy API is recommended because:

  1. It allows Polars to automatically optimize queries through its query optimizer.
  2. It enables working with datasets larger than your system’s memory through streaming.
  3. It can detect schema errors before data processing begins.

When you read a dataset in lazy mode, Polars creates a LazyFrame instead of a DataFrame. To view the output of a LazyFrame, you must call the collect method on it.

In contrast, pandas operates only in eager mode. It lacks a lazy mode, which is ironic given that the panda is often called the laziest bear. This means pandas executes code sequentially without query optimization. Additionally, pandas is limited to datasets that fit into your system’s memory.

Another major difference between the two libraries is the use of indexes. By default, pandas dataframes include an index, while Polars dataframes do not. As a result, the position of rows in a Polars dataframe may not always be consistent across runs. If you need an index in a Polars dataframe, you can create one using the with_row_index method, which assigns a sequential number to each row starting from zero.

Both Polars and pandas provide built-in plotting capabilities. In pandas, you can use the plot method, which relies on the Matplotlib library in the backend. Polars also supports the plot method, but it uses Altair as the backend plotting library. Because pandas has been around longer, more visualization libraries natively support pandas dataframes than Polars dataframes. However, support for Polars is growing, with libraries such as Plotly and HvPlot now offering native compatibility. If a visualization library supports only pandas dataframes, such as Plotnine, you can easily convert a Polars dataframe to a pandas dataframe using the to_pandas method. Conversely, you can convert a pandas dataframe to a Polars dataframe using the from_pandas expression. These conversion functions are unique to Polars.

Another key distinction is performance. Polars is multi-threaded and uses all available CPU cores on your machine, whereas pandas is single-threaded. This makes Polars significantly faster for many operations.

The ability to have hierarchical columns in a pandas dataframe is another distinguishing feature. For example, a pandas dataframe of clothing brands could have two main columns, Premium with sub-columns Gucci and Versace, and Basic with sub-columns Gap and Forever21 as illustrated in Table 1.2.

Table 1.2: Hierarchical columns in a pandas dataframe.
Premium Basic
Gucci Versace Forever21 Gap
0 1000 3000 27.95 23.99
1 2500 5000 39.98 54.85

This kind of hierarchical column structure is not supported in a Polars dataframe.

Finally, although both Polars and pandas are Python libraries, they’re written in different programming languages: Polars in Rust, and pandas in C.

1.3 Data Formats Polars Can Read

Polars is a versatile library that can read and process a wide range of file formats. Data analysts frequently work with datasets stored in different formats, and Polars provides native support for many of the most common ones. Some of the file formats Polars can read include:

  • CSV – text-based files with extensions such as .csv, .txt, and .tsv.
  • Excel – spreadsheet files with extensions such as .xlsx and .xlx, including macro-enabled .xlsm files.
  • JSON – files with the .json extension, commonly used for data retrieved from web APIs.
  • Parquet – files with the .parquet extension, which use columnar storage, compression, and preserve column data types.
  • Database – data stored in relational databases, which Polars can read directly through database connections.
  • Avro – files with the .avro extension, used with the Apache Avro data serialization system. These files are common in big data and distributed systems where efficient storage and data exchange are required.

Polars also supports many additional file formats beyond those listed here.

1.4 Conclusion

Data analysis is a dynamic field that supports a wide range of needs across companies and institutions. Polars provides a powerful toolkit for working with data, particularly when that data is organized in tables. Importing data into a project and performing initial exploration represents only one stage of a broader analytical process, and data professionals routinely work with many different file types along the way. With the foundations of data analysis, Polars, and common data formats now in place, the remainder of this book focuses on applying Polars across the full analytical workflow. Chapter 2 focuses on data preparation, starting with an introduction to data types and then moving on to profiling, cleaning, and shaping data. Chapters 3 through 6 present applications of data analysis, focusing on time series analysis, text analysis, anomaly detection, and experiment analysis. Chapter 7 covers techniques for developing complex datasets for further analysis in other tools. Chapter 8 introduces ten questions to increase understanding of Polars concepts. Finally, Chapter 9 concludes with a list of some additional resources to support your analytics journey.


  1. Ritchie Vink and Polars Contributors, “Polars: A Fast DataFrame Library in Rust and Python” (https://pola.rs/, 2025).↩︎

  2. The expression pd.col will be added to pandas version 3.0.↩︎